Compact light field photography towards versatile three-dimensional vision

Inspired by natural living systems, modern cameras can attain three-dimensional vision via multi-view geometry like compound eyes in flies, or time-of-flight sensing like echolocation in bats. However, high-speed, accurate three-dimensional sensing capable of scaling over an extensive distance range and coping well with severe occlusions remains challenging. Here, we report compact light field photography for acquiring large-scale light fields with simple optics and a small number of sensors in arbitrary formats ranging from two-dimensional area to single-point detectors, culminating in a dense multi-view measurement with orders of magnitude lower dataload. We demonstrated compact light field photography for efficient multi-view acquisition of time-of-flight signals to enable snapshot three-dimensional imaging with an extended depth range and through severe scene occlusions. Moreover, we show how compact light field photography can exploit curved and disconnected surfaces for real-time non-line-of-sight 3D vision. Compact light field photography will broadly benefit high-speed 3D imaging and open up new avenues in various disciplines.

We thank the reviewer for the suggestion to clarify the difference between the proposed CLIP and previous compressive light field imaging methods. CLIP differs in both imaging models and implementation hardware. More fundamentally, it is a systematic method to design and transform any imaging models that employs nonlocal image acquisition into an efficient light field imaging method.
The perspective transform (and resultant imaging model) in compressive light field photography by Marwah et. al. 3 is applied on the encoding mask, and is static once the mask is fixed inside the camera, while the perspective transform in CLIP is applied on the sub-aperture images and need be numerically adjusted to change the reconstruction focus, similar to conventional light field cameras when refocusing onto different depths. Secondly, we showed in the revised Supplementary Note 5 of Supplementary Materials that existing compressive light field imaging methods are illsuited to 0D, 1D, or a sparse 2D detector while CLIP can accommodate detectors of arbitrary formats. Moreover, we demonstrate experimentally in the revised manuscript that CLIP can recover a 4D light field or directly reconstruct a refocused image from the same measurement data, and showed in Supplementary Note 8 that the later approach has the advantage of coping with complex scenes better.
A detailed comparison against existing methods was added in the dedicated Supplementary Note 5 of Supplementary Materials, and the 4D light field reconstruction versus direct reconstruction of refocused images are added in Supplementary Note 8, both are appended below for clarification:

Supplementary Note 5. Comparison of CLIP with compressive light field photography
Existing compressive light field imaging methods are not necessarily convolutional and can recover a 4D light field (na×na×N×N) from a 2D image (N×N). We compare them with CLIP and explain the unique advantages of CLIP in using sensors of arbitrary formats for efficient light field imaging. Most compressive light field photography methods share the roots with coded aperture imaging in using a mask (transmissive or reflective) to divide the system aperture into small patches, each modulating a sub-aperture image. The resultant sensor measurement is a weighted integration of all the sub-aperture images: where 1 ∈ ℝ 2 ×1 is the vectorized sensor image, ∈ ℝ 2 × 2 is the identity matrix. 1 ≠ coefficients and relied on the sparsity prior for a compressive reconstruction of a 4D light field. Ashok et.al., further showed that one can use a similar coding scheme for each microlens in an unfocused light field camera, and recover the spatial image on the microlens with a sub-Nyquist measurement dataset, thereby addressing the angular-spatial resolution tradeoff in unfocused light field cameras. Nevertheless, multiple measurements are still needed in Ashok and Babacan's methods for recovering a light field. Marwah 3 et.al., generalized the mask position to anywhere between the aperture and the sensor. When the mask is positioned close to the sensor, different sub-aperture images are modulated with sheared (and thus incoherent) mask codes before being integrated by the sensor: where ∈ ℝ 2 × 2 is the block diagonal matrix containing the sheared mask code. One key improvement of Marvah's work lies in the modulation of each sub-aperture image P with a random code rather than as in Supplementary Eq. 17, thereby improving the conditioning of the inverse problem as is incoherent with respect to each other. Coupled with a dictionary learning process that better sparsifies a 4D light field, Marwah's approach can recover a full 4D light field from a single measurement, eliminating the need of changing the mask codes.
The diffuser-camera-based light field imaging 4,5 differs from the above approaches in being convolutional: each sub-aperture image is convolved with a random nonlocal point-spreadfunction (PSF) before integration: with ∈ ℝ 2 × 2 being the Toeplitz convolution matrix for the random PSF in the k-th angular view. Light field imaging based on a diffuser camera can be implemented with both lens 8 and lensless manners 7 . When being used with a lens, the PSF for each sub-aperture image is more compactly supported, leading to an efficient utilization of the sensor pixels owing to smaller boarder effects). In contrast, the lensless approach features system simplicity, and it is free from lens-aberrations.
It is now clear that the differentiating factor among existing compressive light field imaging methods is the matrix operating on each sub-aperture image. The matrices ( , ) in Ashok, Babacan, and Marwah et.al. are all diagonal. As a result, the sensor resolution directly determines the spatial resolution of the recovered light field (both and P are in ℝ 2 ×1 ), making these methods ill-suited for 0D, 1D, and sparse 2D sensors. In contrast, the Toeplitz matrix in diffuser-camera-based light field imaging is non-diagonal, and its row vectors multiplex multiple elements of P into one measurement in (owing to a nonlocal PSF). Though not being demonstrated yet, this allows in theory the recovery of a 4D light field from a sub-Nyquist measurement dataset (that is ∈ ℝ ×1 with < 2 while P ∈ ℝ 2 ×1 ).
In contrast, CLIP is a systematic method for designing and transforming any imaging methods with nonlocal data acquisition into a highly efficient light field imaging approach. For a given imaging model with measurement matrix , the transformation of CLIP is achieved by splitting the measurements into different angular views, as illustrated below: where is a row vector, and (an image from a single angular view) is extended to a 4D light field (P 1 to P ) with l=na 2 views (sub-apertures). While the imaging model becomes bock diagonal, recovering the light field is equivalent to solve each sub-aperture image P k with a corresponding sub-measurement matrix . We can further exploit the correlations (redundancy) in the 4D light field by solving Supplementary Eq. 20 with appropriate sparsity based regularizations, as used in compressive light field imaging methods [3][4][5] . It is noteworthy that the elemental matrix is not diagonal as or , a key fact that enables CLIP to use 0D or 1D sensors for light field imaging. We demonstrated 4D light field recovery using CLIP in Supplementary Note 7.
The second key differentiating factor of CLIP is explicit modeling of the correlations among sub-aperture images as P = ℎ via light field propagation assuming a uniform angular intensity distribution as derived in Supplementary Note 1. This simplifies Supplementary Eq. 20 to the CLIP equation 3 in the main text: This step has the advantage of enabling more complicated images to be recovered without the need of finding/learning a better sparsifying basis for the 4D light field, which is an important step in Marwah's work. We show this advantage in Supplementary Note 8.
The computation complexity of compressive light field photography and CLIP depends on the light field resolution and the applied regularization method under the framework of regularization by denoising (see Methods). In CLIP, each iteration involves a pass of ′′ and ′′ along with a denoising step. The complexity for the shearing operation and matrix is o( 2 2 ) and o( 2 ) respectively, leading to a total complexity of o(( 2 + ) 2 ) for both ′′ and ′′ . The complexity of BM3D and TV denoising for regularization is directly related to the image size as o( 2 ), with k being a denoiser-dependent constant. Therefore, the total complexity of CLIP image recovery is o((2 + 2 2 + ) 2 ) per iteration. In comparison, while the complexity for ′ and ′ in Supplementary Eq. 20 for retrieving the 4D light field remains o( 2 ) owing to the block diagonal structure, the denoising complexity of a 4D light field becomes o( 2 2 ), resulting in a total complexity of o((2 + 2 ) 2 ). Similarly, we can analyze the computation complexity per iteration for compressive light field imaging methods based on the model in Supplementary Eq. 17 to 19. Supplementary Table 1 summarizes the characteristics of CLIP and compressive light field photography. It is worth noting that the computation complexity of Marwah's work does not account for the dictionary learning process, and the regularization is applied on the entire light field. Also, the convolution model of the diffuser-camera is accelerated by FFT.

Supplementary Note 8. CLIP 4D light field reconstruction versus direct reconstruction
While CLIP can recover a 4D light field as demonstrated in previous note, we show here that directly recovering a refocused image can better accommodate complex scenes, particularly for imaging with lower dimension (1D or 0D) sensors. Marwah's work relied on a dictionary learning process to obtain a representation basis to better sparsify the 4D light field, thereby attaining excellent 4D light field reconstruction for complex scenes. On the other hand, Antipa 4 pointed out that improper regularization of the 4D light field in diffuser-based camera can degrade (or even destroy) the angular information in the light field.
In contrast, CLIP doesn't rely on high quality 4D light field reconstruction to obtain excellent refocused images: CLIP's complementary measurements among sub-apertures can significantly improve the refocused images despite the recovered 4D light field may not be of high quality, which is the case unde the compressive regime. Further, CLIP can directly recover a refocused image like coded-aperture and wavefront-coding methods to accommodate complex scenes better, as explained in previous section. We demonstrate this via a synthetic study for the synthetic scene 2 and an experimentally acquired light field from the 'letter scene', using a sampling ration of SR=1. During the reconstruction for the 4D light field, the regularization parameter is tuned from to obtain a best refocused image from the light field data. Supplementary Figure 11 shows the recovered 4D light field and refocused images for the two scenes under the CLIP-1D (a and b) and CLIP-0D (c and d) implementations, with the NMSE listed in Supplementary Table 4. It is noted that while the light field suffers from significant background signals and noises, the refocusing processing coherently assembles CLIP's complementary imaging across the sub-apertures to yield substantially better refocused image. Moreover, CLIP's direct reconstruction further improved the quality of the refocused image by rendering more image details and a higher contrast. Supplementary Figure 11. 4D light field reconstruction versus direct reconstruction of refocused images by CLIP. a-b, CLIP-1D reconstruction for the synthetic scene and the experimental 'letter' scene. c-d, CLIP-0D reconstruction for the two scenes. The sampling ratio of CLIP is fixed at SR=1.  3. The paper uses algorithm speed as a major factor to motivate the work, but doesn't do a lot to quantify or verify that statement. The authors mention runtimes of methods in some places, but often the runtimes include a bulk time for different methods like line of sight and NLOS reconstruction. There is also no comparison of computational or memory complexity. I think the paper needs to include some meaningful comparison to alternative methods and a discussion of the expected improvements in performance over prior methods. It also needs to provide complete and structured information about the actual execution speeds and put those in some meaningful context.

Response:
We than the reviewer for this suggestion. We analysed and compared the computation complexity per iteration for CLIP reconstruction with compressive light field imaging methods in Supplementary Regarding NLOS imaging with CLIP camera, there are two parts for the reconstruction: CLIP reconstruction of the (x, y, t) data cube in the first part and then applying the hybrid time-frequency domain algorithm to recover a 3D hidden scene in the second part. CLIP can accelerate (x, y, t) data acquisition (down to a single shot), but the iterative reconstruction is not fast enough for real-time imaging. To solve this problem, we show in Supplementary Note 13 (originally Supp. Note 9) that a fast 'adjoint reconstruction' of CLIP can be used for NLOS imaging at the expense of a degraded imaging robustness against noises. We also compared the proposed hybrid time-frequency domain algorithm with alternative methods in terms of computation and memory complexity in the revised Methods Section of the main text, along with the execution time that includes CLIP reconstructions of the (x, y, t) data cube. The comparisons are excerpted below to clarify these points. … For a 128×128×128 imaging volume with a spatiotemporal data cube of 125×125×1016, the NLOS reconstruction time is ~0.03 seconds, which can reach a 30 Hz video rate. The actual bottleneck lies in the iterative CLIP reconstruction of the spatiotemporal data cube on the wall, which takes about 2.0 seconds. However, we show in Supplementary Note 9 that a fast CLIP solution via the adjoint operator can reduce the reconstruction time to 0.01 seconds for NLOS imaging at the expense of noise robustness. Table 1 summarizes the computation and memory complexity of the hybrid time-frequency domain reconstruction method against the timedomain phasor field method and f-k migration for NLOS imaging with curved surfaces. It is noteworthy that the complexity of f-k migration includes the necessary preprocessing step for coping with curved surfaces, and its execution time is obtained by CPU processing with a downsampled spatiotemporal data cube of (32×32×512) instead of (125×125×1016). Conforming with the complexity analysis, the preprocessing step is more time consuming than the actual reconstruction in f-k migration.

Total reconstruction
(seconds) 4. Similarly, the paper talks about the robustness of the work to missing or erroneous pixels, but does not actually do anything to test that.

Response:
We thank the reviewer for raising this point. The robustness against missing pixels was demonstrated to some extent by using a sparse 2D detector for light field imaging in the original Supplementary Figure 4 of Supplementary Materials. We added both experimental and synthetic results in dedicated Supplementary Note 9 of the revised Supplementary Materials to further demonstrate and quantify its robustness against to erroneous or missing pixels, which is referred to in the revised main text as " … endows CLIP with imaging robustness against defective pixels or scene occlusions. Because the complete scene is encoded in any subset of the measurements, image recovery is not substantially affected by a fraction of defective pixel readings, despite that the conditioning of image reconstruction might deteriorate (Supplementary Note 9)… " Supplementary Note 9 is appended below to clarify this point.

" Supplementary Note 9. CLIP robustness
The robustness against missing pixels of CLIP is demonstrated to some extent by imaging with sparse 2D detectors in Supplementary Note 5. Here, we further test the robustness of CLIP against erroneous measurements. Two typical errors are dead (or missing) pixels and saturated sensor readings. We tested the case that the measurement containing both types of errors by first normalizing the measurement data, and then randomly setting part of the measurement to 0 (dead) or 1 (saturated). The error induced by defective measurement is evaluated by NMSE for both the raw measurement data and reconstructed images. Fixing the sampling ratio SR at 1, we varied the percentage of the erroneous measurement from 0.1% to 1% for the experimental data in CLIP-0D, and 1% to 10% for the synthetic data in CLIP-1D. Supplementary Figure 12 shows the CLIP imaging results and the corresponding NMSEs are summarized in Supplementary Table 5. Owing the nonlocal data acquisition strategy and the regularization step, the reconstructed image error in CLIP is substantially smaller than the error in the raw measurements, making it more robust than classic imaging methods.

…"
5. I'm a little confused about the actual setup used in the different demonstrations. The imaging setup with the streak camera and lenslet array should result in a light field array with a diameter similar to the slit of the streak camera. So a centimeter or two. An array of that size should not be big enough to image around the occlusions they create. The authors explain the geometry that determines the permissible size of the occluder in the supplement. I think some added clarification and maybe a sketch of the setup is needed here.

Response:
In principle, it is the relative scale between the camera baseline and object scene that matters for imaging through occlusions. We added a photograph for the setup of the proof-of-concept experiments in Supplementary Note 13b, and gave numerical details in the revised Supplementary Note 10 to clarify this point, which reads "…A photograph of system setup for the dynamic imaging experiment is shown in the bottom of Supplementary Figure 13b, where the camera baseline L is ~15 mm, and the occluder was placed at approximately d=50 mm (or ~40 mm in the static studies) from the lenslet array. For an occluder with width ≈ 6 (or 10) mm, the inactive region is hence ≈ 33 ( 80) . The object was positioned at a distance ~70 mm (or >90 mm for different static studies) from the occluder to avoid falling into the inactive region… " Occluder paper about the FK migration algorithm that the authors use also describes reconstruction from non-planar surfaces.

Response:
We agree that NLOS imaging with dynamic non-planar relay wall has been demonstrated by La Manna et. al., where they scanned a collimated laser beam rather than the SPAD detector for 2D recording of the time-of-flight data. The reception point was fixed at a stationary point, sidestepping the depth-of-field problem of the detection optics. When using array detectors to accelerate NLOS imaging acquisition on curved surfaces, Manna's method will suffer the depth-of-field problem as usual. In contrast, CLIP can use a 1D array detector for fast imaging with curved surfaces.
The f-k migration algorithm indeed can be adapted for NLOS imaging with curved surfaces, but its confocal imaging process still suffers from a long acquisition time, and as compared in Table 1 of the revised Methods Section (see Response to Comment 3), its preprocessing step to cope with non-planar surfaces is actually more timeconsuming than the actual f-k migration step or the time-domain phasor field method.
Regarding this, we added a discussion on these two points in the section of "NLOS imaging with curved and disconnected surfaces" that reads "… The ToF-CLIP camera addresses this critical need for real-time mapping of the relay surface via built-in flash LiDAR imaging. More importantly, it can accommodate a non-planar surface geometry for NLOS imaging using array detectors with its light field capability. Paired with a proposed hybrid timefrequency domain reconstruction algorithm, which can handle general surfaces with a computational complexity of o(N 4 ) (Methods), ToF-CLIP can attain real-time NLOS imaging with arbitrary curved surfaces. While NLOS imaging with a dynamic and curved surface has been demonstrated by Manna 10 et. al., its reception point was fixed at a stationary point rather than being on the dynamic surface, making it inapplicable for real-time imaging with array detectors. Similarly, the preprocessing step 11 proposed by Lindell et.al. that adapts the f-k migration reconstruction algorithm to deal with slightly curved surfaces in confocal NLOS imaging has a computational complexity of o(N 5 logN), which is higher than the time-domain phasor field method and thus inefficient for real-time reconstruction…"

Response:
We agree with the reviewer that FK migration need to interpolate from a spherical coordinate onto a Cartesian one in the Fourier domain that causes high memory usage.
We clarify that the frequency-domain NLOS reconstruction method that we used was the fast frequency-domain phasor-field method proposed by Liu et. al., which consumes much less memory.
The interpolation in time domain is to correct for the perspective distortion of the recording camera on a non-planar surfaces, thereby yielding a regular 2D grid sampling pattern on the virtual plane to facilitate subsequent frequency-domain NLOS reconstruction (otherwise, a nonuniform FFT based NLOS reconstruction algorithm needs to be developed). This interpolation is not needed in the proposed hybrid timefrequency domain method because the waves can be directly migrated in the timedomain to a regular 2D grid on the virtual plane.
Regarding this, we clarified in the revised Methods section that the second part of the hybrid time-frequency domain reconstruction is the frequency-domain phasor-field that reads " … The hybrid frequency-time domain reconstruction method proposed here first converts the spatiotemporal measurement on a curved surface ( , ) onto a virtual plane via wave propagation in time domain and then reconstruct the hidden scenes with existing efficient frequency-domain phasor field method 9 …"

Reviewer 2
The manuscript reports on a method for light field photography in which, in essence, instead of acquiring L different images from different view points, only one pixel or one line of each image is acquired but each one from a different perspective. These are then combined together through an minimisation approach that relies on a "shear" operator that models how the various parts of the scene are captured at varying view points and then registered in a single final image.
The idea is clever and seems to deliver very promising results. The authors show many different possible implementations of the technique, ranging from 2D imaging, flash lidar to non line of sight imaging. I am not personally convinced that the NLOS imaging results are that significant compared to the state of the art. However, the other results do look convincing, including the measurements in the presence of occluders. The video material provided is also very convincing.
The work is very carefully prepared with sufficient details to reproduce the results. My only comment is that the supplementary information is actually very much integral to the main work as many or most of the actual results are presented there. This is probably a choice based on the fact that the authors present so many different cases that it is hard to show all results in the main text. But this is just a stylistic choice and does not impact the importance of the work itself.
I therefore suggest acceptance of this work for publication without any need for revision.

Response:
We appreciate the reviewer's positive comments on our work. Regarding the NLOS imaging methods, the quality is not yet state-of-art because it is imaged with a small number of time-of-flight sensors (a 1D sensor in our demonstration) in a snapshot, scanless manner (<0.1 s), which causes a high compression factor (~20) for NLOS imaging.

Reviewer 3
In this manuscript, the authors report their development of an imaging method which they call "compact light field photography (CLIP)". They claim that CLIP enables three-dimensional imaging with fewer detectors compared to conventional light field photography methods. They demonstrated volumetric imaging by combining CLIP with other imaging techniques, such as time-offlight, LiDAR, and non-line-of-sight 3D imaging. The main argument of this work is that the data size can be reduced compared to conventional light field photography, which is advantageous for large-scale, high-dimensional photography. However, I do not think the quality of the manuscript meets the publication criteria of Nature Communications in terms of novelty, quality of presentation, and impact of the results. Detailed comments are listed below:

Response:
We appreciate the reviewer's extensive and constructive comments on our work. Extensive revisions have been made on the manuscript and supplementary materials accordingly to address the raised points, as detailed below. arranged sensors (a single pixel, a linear array, or a sparse 2D area detector). Whether this compression works for retrieval of 3D images depends on the scene (as long as the restricted isometric property of the measurement matrix is not evaluated). The authors' approach seems to inherently lack generality.

Response:
We thank the reviewer for raising the important point on evaluating the RIP (restricted isometric property) of the measurement matrix when working in the compressive regime. We clarified that, while a major appeal of CLIP is to use a limited sensor budget to acquire large-scale light fields, it is not necessarily confined to the compressive regime for directly solving a refocused image. Also, CLIP can well accommodate, but is not limited to, those special sensor formats. When working in the compressive regime, we show in the revised manuscript that CLIP is general enough for recovering structured-sparse signals such as natural images.
As detailed in the Revised Supplementary Note 5 (Please see Response to Comment 3 of Reviewer 1) that articulates the difference between CLIP and compressive light field imaging methods, CLIP can transform any imaging model y= Ax ( ∈ ℝ × ) that employs nonlocal data acquisition into a light field imaging method. The resultant CLIP equation y=A'x has the same dimension with measurement matrix A (that is, A′ ∈ ℝ × ). As a result, it is not necessarily under-determined and works equally well for imaging methods using dense 2D sensor arrays-it is proved in Supplementary Note 3 that CLIP can include coded-aperture and wavefront-coding based light field imaging methods as special cases. The motivation of using 0D and 1D sensor is that they are far more accessible for imaging at the ultrafast time scale or in the infrared/Terahertz spectral band, for which existing compressive light field imaging methods are ill-suited, as proved in Supplementary Note 3 and 5. Also, the mathematical model of CLIP with 0D an 1D sensors are transformed from the imaging model of the single pixel camera and x-ray computed tomography respectively, which have been demonstrated to show the generality for imaging applications in practice when working in the compressive regime.
Mathematically, the generality of CLIP in the compressive regime can be evaluated by computing the RIP constant of the measurement matrix A' as mentioned by the reviewer. However, RIP is only a sufficient condition, and evaluating RIP is a NP-hard problem. We added extensive numerical tests to demonstrate the generality of CLIP under the structured-sparse signal model in the dedicated Supplementary Note 6 of Supplementary Materials, and stressed in the revised Methods section that CLIP has the generality to recover structured-sparse signals such as natural images, which reads " … It is worth noting that while recovering the 4D light field is always compressive in CLIP, directly retrieving a refocused image is not necessarily the same. Still, a major appeal of CLIP is to use a small number of sensors for recording a large-scale light field, which typically falls into the compressive sampling regime. In this case, we show in Supplementary Note 6 that while the imaging model designed in or transformed by CLIP may not satisfy the restricted isometry property (RIP) to guarantee uniform recovery of arbitrary images in the classic sparse signal model, CLIP has the generality in the structured-sparse signal model and hence remains applicable for practical imaging applications." The added Supplementary Note 6 is appended below for a detailed explanation.
"Supplementary Note 6. Generality of CLIP While recovering a 4D light field is always under-determined in CLIP and compressive light field photography methods, directly recovering a refocused image by CLIP is not necessarily the same. As a result, CLIP isn't bounded to the compressive regime, though one of its major appeal is to record a large-scale light field with a highly limited sensor budget. When working in the compressive regime, it is important to evaluate whether the system matrix ′ of CLIP supports a uniform recovery of arbitrary s-sparse vectors (vectors with at most s non-zero entries) in the classic sparse signal model by computing the restricted isometry property (RIP) of matrix ′ . However, RIP is not a necessary condition and computing the RIP constant is an NP-hard problem. Up to now, only a limited types of matrices have been proven to satisfy RIP with an exponentially high probability. On the other hand, it was shown 12 that there is an absence of RIP in a range of practical compressive imaging applications, and yet, experimental image recovery is excellent. These applications include compressive x-ray tomography, MRI, and single pixel cameras. The work of Bastounis 12 and Roman 13 , among other similar works 14 , attributed the correct recovery of image x to the structured-sparsity of x (that is, the sparsity of x has a structure instead of exhibiting an arbitrary pattern), and together with an extended concept of RIP in levels, explained the success of these compressive imaging methods in practice, despite that their measurement matrices failed to satisfy the classic RIP. As natural images are highly structured, and CLIP with 0D and 1D sensors are transformed from the single pixel cameras and x-ray tomography methods respectively, it is expected that CLIP can attain similar imaging performance in practice.
We followed the philosophy of generalized flip test proposed by Roman et.al. 13 to evaluate the general applicability of CLIP under the structured-sparsity signal model. This idea of the test is to evaluate the reconstruction quality of different images with the same sparsity. To generate such images, we applied shift, flip, rotation operation on some image part, and evaluated the reconstruction error using normalized mean square errors (NMSE). As CLIP deals with light field data, these operations should be applied to 3D objects. To this end, the 3D scenes were modelled in Blender software for rendering the 4D light field data on a regular 2D grid.
Throughout the manuscript, synthetic CLIP measurement with 1D and 0D sensors were obtained as follows. In CLIP-0D, each sub-aperture image is encoded with random binary codes to yield mk=m/l single-pixel readings. For CLIP-1D, the measurements are obtained in three steps: a) generate m/N projection angles α uniformly in the range of [0, 180 o ]; b) randomly permute the angles α and distribute evenly into the l sub-apertures; c) calculate for each subaperture image the projection data along the assigned angles. The sampling ratio (SR) is defined as the quotient between the total number of measurements m and the image size N 2 (rather than the 4D light field). For this test, we fixed SR at 0.5.
Supplementary Figure 6 and 7  Supplementary Figure 6. Generalized flip test of CLIP reconstruction for synthetic scene 1 with SR = 0.5. The ground truth light field size is 8×8×128×128, and the measurement data size is 64×128, leading to a data reduction of 128.  Fig. S5d, the angles and positions of cylindrical lenses, the widths of slits, and the distance between the lenses and the sensor critically change the intensity profile on the sensor. In addition, aberration due to the imperfection of the lenses induces loss of information. As long as one can acquire the entire image, it should be taken. If compression is needed, we can do it by post processing using an FPGA in a high-throughput, lossless, and reproducible manner.

Shift Rotate
We agree with the reviewer that when a suitable 2D sensor is available for a target application, acquiring the image at/over the Nyquist rate and then compressing it in post processing will be advantageous in terms of fidelity. However, this is not always feasible, and the motivation for compressive sampling is mainly two-folds.
First and foremost is the availability (and economy) of suitable detectors for acquiring the signal of interest. Currently, there is no ultrafast detectors similar to consumer-grade 2D CMOS or CCD image sensors for snapshot acquisition of largescale time-of-flight data or any similarly high-dimensional data such as hyperspectral images. Existing ultrafast cameras are in the format of a single pixel (SPAD, PMT etc.), a linear array (streak camera or linear array SPAD), or a sparse 2D array (state-ofthe-art SPAD array has a relatively low fill factor below 50%). As a result, a slow scanning (spatial and/or temporal) mechanism is needed for 2D time-of-flight (or hyperspectral) imaging. Other applications for which 0D and 1D sensors are more accessible include imaging in the infrared and Terahertz region, where the detector resolution remains low. It is the limited sensor budget that hampers light field imaging in these applications.
The second motivation is the compressibility of natural signals, especially highdimensional signals. As pointed out by the reviewer, signal compression is typically done by post-processing after a full acquisition. However, for applications suffering from detector availability issues but dealing with highly compressible signals, compression in sampling phase can become advantageous because much fewer sensor measurements will be needed. Indeed, the compressibility of natural images has been well-exploited in many imaging applications. For example, x-ray CT and MRI imaging has adopted compressive sampling to reduce the radiation dose and imaging time. The compressibility of 4D light fields or natural images is also a key ingredient in existing compressive light field imaging methods.
The factors affecting optical compression is accounted for by a system calibration step, as typically done in other computational imaging methods. In theory, the nonlocal sampling and structured-sparse signal recovery strategy of CLIP could potentially be more robust against information loss and lens aberration of the optical system. The 0D implementation does not need a lens as demonstrated in the single pixel cameras, and CLIP with a 1D sensor tends to suffer from less aberration than conventional imaging because there is no optical power in the invariant axis of the cylindrical lens. The setup in Fig. S5d differs from a coded aperture camera only in replacing the spherical lens-system with a cylindrical one that modifies the ideal point spread function from a point into an angled line. The effects of other factors on signal intensity on the sensor -the positions of cylindrical lenses, the widths of slits, and the distance between the lenses and the sensor, remain the same as that in coded aperture cameras. For example, a change in the distance of the lens and sensor will defocus the image signal (with a circular blur bokeh being replaced by an elliptical one). The width of slits defines the encoding resolution. The lens position relative to the sensor determines the imaging field of view. All these factors (except aberrations) are obtained after system alignment and calibration.
Regarding this, we clarified the calibration step for Fig. S5d in revised Supplementary Note 5, which reads " … It is noted that the implementation for randomly coded line-shape PSF is very similar to the coded-aperture camera, with the camera lens and image sensor being replaced by a cylindrical one and 1D sensor respectively. Like codedaperture imaging therefore, a one-time calibration step for the camera will be needed to retrieve PSF on the sensor by imaging a point source and scanning the 1D sensor along the other dimension …" Also, we included a dedicated Supplementary Note 9 on evaluating the robustness of CLIP under information loss (in the form of missing and erroneous measurements). Please see Response to Comment 4 of Reviewer 1 for more details.

The definition of CLIP is unclear. The authors' statement "
To address these challenges, we present compact light field photography (CLIP) to sample dense light fields with a drastically improved efficiency and flexibility. By employing nonlocal image acquisitions and distributing a complete acquisition process into different views, CLIP enables light field imaging with a measurement dataset smaller than a single sub-aperture image and remains natively applicable to camera array systems" sounds no more than compressed light field photography, which has been thoroughly investigated.

Response:
Because both Comment 3 and 4 concerned the distinction between CLIP and existing compressive light field photography methods, we address them together in our Response to Comment 4.

Response:
We agree that pixel binning or extraction can reduce the measurement easily and effectively, but this will equally reduce the imaging (or light field) resolution, and require the intended applications to have a dense 2D CCD/CMOS camera to begin with. In contrast, CLIP can use a sensor of a limited resolution, such as 0D or 1D sensors, for efficient light field imaging by transforming an appropriate imaging model that employs nonlocal data acquisition. As examples, CLIP transformed the imaging model of the single pixel camera and x-ray CT for efficient light field imaging with a single pixel and a 1D sensor respectively in the manuscript.
Moreover, we demonstrated experimentally in the revised manuscript that CLIP can recover a 4D light field or directly reconstruct a refocused image from the same measurement data (see Response to Comment 5 below for details). While existing compressive light field imaging methods recover a 4D light field from a densely sampled 2D image, we showed in Supplementary Note 8 that the CLIP's approach of directly reconstructing a refocused image has the advantage of coping with complex scenes better. A detailed comparison against existing compressive light field imaging methods was added in the dedicated Supplementary Note 5 of Supplementary Materials, and the direct reconstruction of refocused images is compared with 4D light field reconstruction in Supplementary Note 8, both are appended below for clarification.

Supplementary Note 5. Comparison of CLIP with compressive light field photography
Existing compressive light field imaging methods are not necessarily convolutional and can recover a 4D light field (na×na×N×N) from a 2D image (N×N). We compare them with CLIP and explain the unique advantages of CLIP in using sensors of arbitrary formats for efficient light field imaging. Most compressive light field photography methods share the roots with coded aperture imaging in using a mask (transmissive or reflective) to divide the system aperture into small patches, each modulating a sub-aperture image. The resultant sensor measurement is a weighted integration of all the sub-aperture images: where 1 ∈ ℝ 2 ×1 is the vectorized sensor image, ∈ ℝ 2 × 2 is the identity matrix. 1 ≠ 1 , and it is a scalar representing the mask transmission coefficient for the k-th sub-aperture image. P ∈ ℝ × is the corresponding vectorized sub-aperture image. It is noted that imaging without the coding mask is equivalent to setting all the weights 1 to 1. While na 2 different set of mask coefficients (and sensor measurements ) are typically needed to recover the light field (P 1 to P 2 ), Ashok 7 and Babacan 8 proposed to use a smaller number m<na 2 of mask coefficients and relied on the sparsity prior for a compressive reconstruction of a 4D light field. Ashok et.al., further showed that one can use a similar coding scheme for each microlens in an unfocused light field camera, and recover the spatial image on the microlens with a sub-Nyquist measurement dataset, thereby addressing the angular-spatial resolution tradeoff in unfocused light field cameras. Nevertheless, multiple measurements are still needed in Ashok and Babacan's methods for recovering a light field.
Marwah 3 et.al., generalized the mask position to anywhere between the aperture and the sensor. When the mask is positioned close to the sensor, different sub-aperture images are modulated with sheared (and thus incoherent) mask codes before being integrated by the sensor: where ∈ ℝ 2 × 2 is the block diagonal matrix containing the sheared mask code. One key improvement of Marvah's work lies in the modulation of each sub-aperture image P with a random code rather than as in Supplementary Eq. 17, thereby improving the conditioning of the inverse problem as is incoherent with respect to each other. Coupled with a dictionary learning process that better sparsifies a 4D light field, Marwah's approach can recover a full 4D light field from a single measurement, eliminating the need of changing the mask codes.
The diffuser-camera-based light field imaging 4,5 differs from the above approaches in being convolutional: each sub-aperture image is convolved with a random nonlocal point-spreadfunction (PSF) before integration: with ∈ ℝ 2 × 2 being the Toeplitz convolution matrix for the random PSF in the k-th angular view. Light field imaging based on diffuser camera can be implemented with both lens 8 and lensless manners 7 . When being used with a lens, the PSF for each sub-aperture image is more compactly supported, leading to an efficient utilization of the sensor pixels owing to smaller boarder effects. In contrast, the lensless approach features system simplicity, and it is free from lens-aberrations.
It is now clear that the differentiating factor among existing compressive light field imaging methods is the matrix operating on each sub-aperture image. The matrices ( , ) in Ashok, Babacan, and Marwah et.al. are all diagonal. As a result, the sensor resolution directly determines the spatial resolution of the recovered light field (both and P are in ℝ 2 ×1 ), making these methods ill-suited for 0D, 1D, and sparse 2D sensors. In contrast, the Toeplitz matrix in diffuser-camera-based light field imaging is non-diagonal, and its row vectors multiplex multiple elements of P into one measurement in (owing to a nonlocal PSF). Though not being demonstrated yet, this allows in theory the recovery of a 4D light field from a sub-Nyquist measurement dataset (that is ∈ ℝ ×1 with < 2 while P ∈ ℝ 2 ×1 ).
In contrast, CLIP is a systematic method for designing and transforming any imaging methods with nonlocal data acquisition into a highly efficient light field imaging approach. For a given imaging model with measurement matrix , the transformation of CLIP is achieved by splitting the measurements into different angular views, as illustrated below: where is a row vector and (an image from a single angular view) is extended to a 4D light field (P 1 to P ) with l=na 2 views (sub-apertures). While the imaging model becomes bock diagonal, recovering the light field is equivalent to solve each sub-aperture image P k with a corresponding sub-measurement matrix . We can better exploit the correlations (redundancy) in the 4D light field by solving Supplementary Eq. 20 with appropriate sparsity based regularizations, as used in compressive light field imaging methods [3][4][5] . It is noteworthy that the elemental matrix is no longer diagonal as or , a key fact that enables CLIP to use 0D or 1D sensors for light field imaging. We demonstrated 4D light field recovery using CLIP in Supplementary Note 7.
The second key differentiating factor of CLIP is explicit modeling of the correlations among sub-aperture images as P = ℎ via light field propagation, assuming a uniform angular intensity distribution as derived in Supplementary Note 1. This simplifies Supplementary Eq. 20 to the CLIP equation 3 in the main text: This step has the advantage of enabling more complicated images to be recovered without the need of finding/learning a better sparsifying basis for the 4D light field, which is an important step in Marwah's work. We show this advantage in Supplementary Note 8.
The computation complexity of compressive light field photography and CLIP depends on the light field resolution and the applied regularization method under the framework of regularization by denoising (see Methods). In CLIP, each iteration involves a pass of ′′ and ′′ along with a denoising step. The complexity for the shearing operation and matrix is o( 2 2 ) and o( 2 ) respectively, leading to a total complexity of o(( 2 + ) 2 ) for both ′′ and ′′ . The complexity of BM3D and TV denoising for regularization is directly related to the image size as o( 2 ), with k being a denoiser-dependent constant. Therefore, the total complexity of CLIP image recovery is o((2 + 2 2 + ) 2 ) per iteration. In comparison, while the complexity for ′ and ′ in Supplementary Eq. 20 for retrieving the 4D light field remains o( 2 ) owing to the block diagonal structure, the denoising complexity of a 4D light field becomes o( 2 2 ), resulting in a total complexity of o((2 + 2 ) 2 ). Similarly, we can analyze the computation complexity per iteration for compressive light field imaging methods based on the model in Supplementary Eq. 17 to 19. Supplementary Table 1 summarizes the characteristics of CLIP and compressive light field photography. It is worth noting that the computation complexity of Marwah's work does not account for the dictionary learning process, and the regularization is applied on the entire light field. Also, the convolution model of the diffuser-camera is accelerated by FFT.

Supplementary Note 8. CLIP 4D light field reconstruction versus direct reconstruction
While CLIP can recover a 4D light field as demonstrated in previous Note, we show here that directly recovering a refocused image can better accommodate complex scenes, particularly for imaging with lower dimension (1D or 0D) sensors. Marwah's work relied on a dictionary learning process to obtain a representation basis to better sparsify the 4D light field, thereby attaining excellent 4D light field reconstruction for complex scenes. On the other hand, Antipa 4 pointed out that improper regularization of the 4D light field in diffuser-based camera can degrade (or even destroy) the angular information in the light field.
In contrast, CLIP doesn't rely on high quality 4D light field reconstruction to obtain excellent refocused images: CLIP's complementary measurements among sub-apertures can significantly improve the refocused images despite the recovered 4D light field may not be of high quality, which is the case unde the compressive regime. Further, CLIP can directly recover a refocused image like coded-aperture and wavefront-coding methods to accommodate complex scenes better, as explained in previous section. We demonstrate this via a synthetic study for the synthetic scene 2 and an experimentally acquired light field from the 'letter scene', using a sampling ration of SR=1. During the reconstruction for the 4D light field, the regularization parameter is tuned from to obtain a best refocused image from the light field data. Supplementary Figure 11 shows the recovered 4D light field and refocused images for the two scenes under the CLIP-1D (a and b) and CLIP-0D (c and d) implementations, with the NMSE listed in Supplementary Table 4. It is noted that while the light field suffers from significant background signals and noises, the refocusing processing coherently assembles CLIP's complementary imaging across the sub-apertures to yield substantially better refocused image. Moreover, CLIP's direct reconstruction further improved the quality of the refocused image by rendering more image details and a higher contrast.  Figure 11. 4D light field reconstruction versus direct reconstruction of refocused images by CLIP. a-b, CLIP-1D reconstruction for the synthetic scene and the experimental 'letter' scene. c-d, CLIP-0D reconstruction for the two scenes. The sampling ratio of CLIP is fixed at SR=1.

5.
The performance of their method is not evaluated well. In compressed sensing, evaluation of data fidelity is essential. Without comparison of the reconstructed images with the ground truth measured by conventional methods (with a lower acquisition rate), it is impossible to judge if the method is good or not.

Response:
We thank the reviewer for pointing out the importance of quantitatively evaluating the image performance (data fidelity) of CLIP. In the revised manuscript, we quantified the accuracy of CLIP using normalized mean square error (NMSE) with respect to the ground truth in both experiments and synthetic studies. We stressed this point in the last paragraph of Principle Section of the revised main text that reads " … We quantified the efficacy of CLIP for light field imaging experimentally with a 0D sensor in Supplementary Note 7, and further evaluated the CLIP reconstruction accuracy synthetically with both 0D and 1D sensors in Supplementary Note 11, which employs CLIP to represent custom-acquired 4D light field data for scenes of different complexities and BRDF characteristics… ".
Detailed experimental quantification for CLIP imaging with 0D sensors (CLIP-0D) is given in the dedicated Supplementary Note 7 of the revised Supplementary Materials, and extensive synthetic quantification for both CLIP-0D and CLIP-1D (using a 1D array sensor) are summarized in Supplementary Table 7

Supplementary Note 7. Quantitative evaluation of CLIP performance in experiments
We quantitatively evaluated the performance of CLIP via experimental measurements when feasible and turned to synthetic studies otherwise. This is because for computational imaging employing nonlocal sampling strategies, ground truth data is typically difficult to obtain experimentally: a system reconfiguration with perfect alignment is necessary. Taking CLIP imaging with 1D sensors for example, one needs to swap the cylindrical lenslet array into its spherical counterpart and adds a 1D scanning to obtain the ground-truth light field. This reference imaging needs to be precisely realigned to show the same magnification and field of view with CLIP: any mismatch will otherwise bias the quantitative evaluation of its imaging accuracy.
For CLIP imaging with 0D sensors, the 4D light field can be fully sampled (though not based on conventional 2D sensors): for each angular position behind the lens, the sub-aperture image can be acquired with a measurement number equal to or larger than the image resolution (thus doesn't rely on compressive sensing), and this imaging process is repeated at all angular positions. CLIP measurement can be readily obtained from this dataset by extracting a small subset measurement from each angular position and stacking the complementarily extracted data into a final measurement as described by Supplementary Eq. 20. We present experimental validation of CLIP with 0D sensor in this section and synthetic evaluation of CLIP with 1D sensor in the following sections.
Two different scenes composed of printed letters were imaged by CLIP-0D experimentally, and both the 4D light field and direct image reconstructions are demonstrated under different sampling ratio SR. The ground truth 4D light field has a resolution of 4×4×128×128 and was obtained by reconstruct each sub-aperture image using a complete measurement. Similarly, the ground truth refocused image was obtained from the 4D light field. Supplementary Figure 8 and 9 shows the 4D light field reconstruction results by CLIP for the two scenes, and the direct reconstruction of different refocused images are given in Supplementary Figure 10. The reconstruction error is quantified by NMSE in Supplementary Table 3. It is noted that both the 4D light field and direct reconstruction of refocused images attained a NMSE error below 10% in experiments.   6. The authors' main claim "enable snapshot 3D imaging with an extended depth range and through severe scene occlusions" in the abstract is suspicious. In the supplementary movies, the shape of the objects significantly changes when they are occluded. Again, quantitative evaluation image reconstruction is needed for supporting their claim.

Response:
We improved the experiments of imaging through occlusions. In previous demonstration, the CLIP measurement across the lenslet array was not sufficiently randomized: the projection angle of the cylindrical lenslet array were uniformly spaced along the array direction. Under occlusions, the object only yields a subset of the full measurement entries. To maximize (in a statistical sense) the incoherence among the measurement subset at any time instant (the occlusion changes dynamically with object motion), it is best to randomly distribute the projection angles of the cylindrical lens along the array direction. This is similar in spirit to use a random subset of Fourier basis for compressive single pixel or MRI imaging: the Fourier basis needs be randomly shuffled first and then chose an arbitrary subset from it.
Guided by this principle, we further randomized the cylindrical lens angles along the array direction in the new experiments, and improved the imaging quality for imaging through occlusions. Regarding this, we revised the Methods section to stress the randomized cylindrical lenslet arrangement that reads " … For optimal imaging through occlusions, the cylindrical lenslet angles are further randomly distributed along the lenslet array direction, as the effective measurement entries for the occluded objects are reduced to a subset of the measurement entry in the imaging model. Such random distribution maximizes statistically the incoherence among any subset of the measurements to ensure consistent image recovery performance…".
The improved imaging results (along with the corresponding Supplementary Video 1) are revised in Fig. 2d in the main text, which is appended below. Furthermore, we also revised Supplementary Note 10 of Supplementary Materials to quantitatively evaluate the performance of CLIP in imaging through occlusions via synthetic studies, which shows that CLIP can achieve a small error (mostly <10% NMSE) for seeing through occlusions with orders of magnitude less data. We clarified this point in the last sentence of '3D imaging through Occlusions' section in the revised main text that reads " … We further quantified the accuracy of CLIP imaging through occlusions via synthetic studies in Supplementary Note 10, which shows a small imaging error (<10%) can be obtained by CLIP despite of a large reduction (>100 times) in light field measurement data… ".
We excerpted Supplementary Note 10 below for a detailed clarification.

" …
We further compare CLIP with conventional light field imaging for seeing through occlusions via synthetic studies. The 4D light field for 3D scenes were rendered in Blender software with a resolution of 8×8×128×128, and CLIP measurement were obtained as in previously sections. Unlike ToF based measurements that can separate signals of the occluder and occluded objects in time, conventional imaging systems can only defocus the occluder, yielding significant background for visualizing the occluded objects. To emulate ToF measurement for minimizing background, the occluder can be made black in Blender such that its image signal is negligible in the generated light field. Supplementary Figure 14 shows four examples of imaging through occlusions: a mannequin standing behind a tree, the mannequin partially occluded by the black rectangular plate, the 'CLIP' letter placed behind a bush, and the 'CLIP' letter being blocked by a black rectangular occluder. The CLIP reconstruction NMSE errors are shown in Supplementary Table 6. It is noted that even with a sampling ratio of SR=0.5 that corresponds to a reduction of the 4D light field by 128 times, CLIP can effectively see through severe occlusions with an error below 10%. With ToF measurement that produces far sparser 2D instantaneous images and separates the occluder signal in time, as